Type safety

Type systems

Type safety
Inferred vs. Manifest
Dynamic vs. Static
Strong vs. Weak
Nominal vs. Structural
Dependent typing
Duck typing
Latent typing
Linear typing
Uniqueness typing

In computer science, type safety is the extent to which a programming language discourages or prevents type errors. A type error is erroneous or undesirable program behaviour caused by a discrepancy between differing data types. Type safety is sometimes alternatively considered to be a property of a computer program rather than the language in which that program is written; that is, some languages have type-safe facilities that can be circumvented by programmers who adopt practices that exhibit poor type safety. The formal type-theoretic definition of type safety is considerably stronger than what is understood by most programmers.

Type enforcement can be static, catching potential errors at compile time, or dynamic, associating type information with values at run-time and consulting them as needed to detect imminent errors, or a combination of both.

The behaviors classified as type errors by a given programming language are usually those that result from attempts to perform operations on values that are not of the appropriate data type. This classification is partly based on opinion; some language designers and programmers argue that any operation not leading to program crashes, security flaws or other obvious failures is legitimate and need not be considered an error, while others consider any contravention of the programmer's explicit intent (as communicated via typing annotations) to be erroneous and not "type-safe."

In the context of static (compile-time) type systems, type safety usually involves (among other things) a guarantee that the eventual value of any expression will be a legitimate member of that expression's static type. The precise requirement is more subtle than this — see, for example, subtype and polymorphism for complications.

Type safety is closely linked to memory safety, a restriction on the ability to copy arbitrary bit patterns from one memory location to another. For instance, in an implementation of a language that has some type t, such that some sequence of bits (of the appropriate length) does not represent a legitimate member of t, if that language allows data to be copied into a variable of type t, then it is not type-safe because such an operation might assign a non-t value to that variable. Conversely, if the language is type-unsafe to the extent of allowing an arbitrary integer to be used as a pointer, then it is clearly not memory-safe.

Most statically typed languages provide a degree of type safety that is strictly stronger than memory safety, because their type systems enforce the proper use of abstract data types defined by programmers even when this is not strictly necessary for memory safety or for the prevention of any kind of catastrophic failure.

Contents

Definitions

Type-safe code accesses only the memory locations it is authorized to access. (For this discussion, type safety specifically refers to memory type safety and should not be confused with type safety in a broader respect.) For example, type-safe code cannot read values from another object's private fields.

Robin Milner provided the following slogan to describe type safety:

"Well-typed programs never go wrong."[1]

The appropriate formalization of this slogan depends on the style of formal semantics used for a particular language. In the context of denotational semantics, type safety means that the value of an expression that is well-typed, say with type τ, is a bona fide member of the set corresponding to τ.

In 1994, Andrew Wright and Matthias Felleisen formulated what is now the standard definition and proof technique for type safety in languages defined by operational semantics. Under this approach, type safety is determined by two properties of the semantics of the programming language:

(Type-) preservation
"Well typedness" of programs remains invariant under the transition rules (i.e. evaluation rules or reduction rules) of the language.
Progress
A well typed program never gets "stuck", i.e., never gets into an undefined state where no further transitions are possible.

These properties do not exist in a vacuum; they are linked to the semantics of the programming language they describe, and there is a large space of varied languages that can fit these criteria, since the notion of "well typed" program is part of the static semantics of the programming language and the notion of "getting stuck" (or "going wrong") is a property of its dynamic semantics.

Vijay Saraswat provides the following definition:

"A language is type-safe if the only operations that can be performed on data in the language are those sanctioned by the type of the data."

Relation to other forms of safety

Type safety is ultimately aimed at excluding other problems, eg:-

Type-safe and type-unsafe languages

Type safety is usually a requirement for any toy language proposed in academic programming language research. Many languages, on the other hand, are too big for human-generated type-safety proofs, as they often require checking thousands of cases. Nevertheless, some languages such as Standard ML, which has rigorously defined semantics, have been proved to meet one definition of type safety.[2] Some other languages such as Haskell are believed to meet some definition of type safety, provided certain "escape" features are not used (for example Haskell's unsafePerformIO, used to "escape" from the usual restricted environment in which I/O is possible, circumvents the type system and so can be used to break type safety.[3]) Type punning is another example of such an "escape" feature. Regardless of the properties of the language definition, certain errors may occur at run-time due to bugs in the implementation, or in linked libraries written in other languages; such errors could render a given implementation type unsafe in certain circumstances. An early version of Sun's Java Virtual Machine was vulnerable to this sort of problem.[4]

Type safety and strong typing

Type safety is synonymous with one of the many definitions of strong typing; but type safety and dynamic typing are mutually compatible. A dynamically typed language such as Smalltalk can be seen as a strongly typed language with a very permissive type system where any syntactically correct program is well-typed; as long as its dynamic semantics ensures that no such program ever "goes wrong" in an appropriate sense, it satisfies the definition above and can be called type-safe.

Type safety issues in specific languages

Ada

Ada was designed to be suitable for embedded systems, device drivers and other forms of system programming, but also to encourage type safe programming. To resolve these conflicting goals, Ada confines type-unsafety to a certain set of special constructs whose names usually begin with the string Unchecked_. Unchecked_Deallocation can be effectively banned from a unit of Ada text by applying pragma Pure to this unit. It is expected that programmers will use Unchecked_ constructs very carefully and only when necessary; programs that do not use them are type safe.

The SPARK programming language is a subset of Ada eliminating all its potential ambiguities and insecurities while at the same time adding statically checked contracts to the language features available. SPARK avoids the issues with dangling pointers by disallowing allocation at run time entirely.

C

The C programming language is typesafe in limited contexts; for example, a compile-time error is generated when an attempt is made to convert a pointer to one type of structure to a pointer to another type of structure, unless an explicit cast is used. However, a number of very common operations are non-typesafe; for example, the usual way to print an integer is something like printf("%d", 12), where the %d tells printf at run-time to expect an integer argument. (Something like printf("%s", 12), which erroneously tells the function to expect a pointer to a character-string, will be accepted by compilers, but will produce undefined results.) In addition, C, like Ada, provides unspecified or undefined explicit conversions; and unlike in Ada, idioms that use these conversions are very common, and have helped to give C a type-unsafe reputation. For example, the standard way to allocate memory on the heap is to invoke a memory allocation-function (malloc) with an argument indicating how many bytes are required. The function returns an untyped pointer (type void *), which the calling code must cast to the appropriate pointer type. Expressions such as (struct foo *) malloc(sizeof(struct foo)) are therefore quite common.

C++

C++ is more type-safe than C when used properly (avoiding the use of void pointers and casting between pointers of two types). Features of C++ that promote code that is less at risk of type errors include:

C#

C# is type-safe. It has support for untyped pointers, but this must be accessed using the "unsafe" keyword which can be prohibited at the compiler level. It has inherent support for run-time cast validation. Casts can be validated by using the "as" keyword that will return a null reference if the cast is invalid, or by using a C-style cast that will throw an exception if the cast is invalid. See C Sharp Conversion Operators.

Undue reliance on the object type (from which all other types are derived) runs the risk of defeating the purpose of the C# type system. It is usually better practice to abandon object references in favour of generics, similar to templates in C++ and generics in Java.

Cyclone

Cyclone is a type-safe language. It does not require a virtual machine or garbage collection to achieve type safety during runtime. The syntax is very similar to C.

Standard ML

SML has rigorously defined semantics and is known to be type-safe. However, some implementations of SML, including Standard ML of New Jersey (SML/NJ), its syntactic variant Mythryl and Mlton, provide libraries that offer certain unsafe operations. These facilities are often used in conjunction with those implementations' foreign function interfaces to interact with non-ML code (such as C libraries) that may require data laid out in specific ways. Another example is the SML/NJ interactive toplevel itself, which must use unsafe operations to execute ML code entered by the user.

Pascal

Pascal has had a number of type safety requirements, some of which are kept in some compilers. Where a Pascal compiler dictates "strict typing", two variables cannot be assigned to each other unless they are either compatible (such as conversion of integer to real) or assigned to the identical subtype. For example, if you have the following code fragment:

type
  TwoTypes = record
     I: Integer;
     Q: Real;
  end;
 
  DualTypes = record
    I: Integer;
    Q: Real;
  end;
 
var
  T1, T2:  TwoTypes;
  D1, D2:  DualTypes;

Under strict typing, a variable defined as TwoTypes is not compatible with DualTypes (because they are not identical, even though the components of that user defined type are identical) and an assignment of T1 := D2; is illegal. An assignment of T1 := T2; would be legal because the subtypes they are defined to are identical. However, an assignment such as T1.Q := D1.Q; would be legal.

Examples

The following simple C++ program illustrates that C++ permits operations that are type-unsafe:

#include <iostream>
using namespace std;
 
int main()
{
    int ival = 5;                   // A four-byte integer (on most processors)
    void *pval = &ival;             // Store the address of ival in an untyped pointer
    double dval = *((double*)pval); // Convert it to a double pointer and get the value at that address
    cout << dval << endl;           // Output the double (this will be garbage and not 5!)
    return 0;
}

Even though pval does contain the correct address of ival, when pval is cast from a void pointer to a double pointer the resulting double value is not 5, but an undefined garbage value. On the machine code level, this program has explicitly prevented the processor from performing the correct conversion from a four-byte integer to an eight-byte floating-point value. When the program is run it will output a garbage floating-point value and possibly raise a memory exception. Thus, C++ (and C) allow type-unsafe code.

The next example shows a slightly more complex type safety issue in C++ involving object pointer conversion.

#include <iostream>
using namespace std;
 
class Parent {};
 
class Child1 : public Parent
{
public:
    int a;
};
 
class Child2 : public Parent
{
public:
    double b;
};
 
int main()
{
    Child1 c1;
    c1.a = 5;
    Child2 c2;
    c2.b = 2.4;
    Parent *p = &c1;          // Down-conversion is safe
    Child1 *pc1 = (Child1*)p; // Up-conversion is not safe
    cout << pc1->a << endl;   // ...but this time it's okay
    p = &c2;                  // Safe
    pc1 = (Child1*)p;         // Not safe
    cout << pc1->a << endl;   // This time it breaks!
    return 0;
}

The two child classes have members of different types. When a pointer to a less-specific parent class is converted to a pointer to a more-specific child class, the resulting pointer may or may not point to a valid object of its type. In the first conversion, the pointer is valid, and in the second, it is not. Again, a garbage value is printed and a memory exception may be raised.

See also

References

  1. ^ Milner, Robin (1978), "A Theory of Type Polymorphism in Programming", Jcss 17: 348–375 
  2. ^ http://www.smlnj.org/sml.html
  3. ^ "System.IO.Unsafe". GHC libraries manual: base-3.0.1.0. http://www.haskell.org/ghc/docs/latest/html/libraries/base/System-IO-Unsafe.html#v:unsafePerformIO. Retrieved 2008-07-17. 
  4. ^ Saraswat, Vijay (1997-08-15). "Java is not type-safe". http://www.cis.upenn.edu/~bcpierce/courses/629/papers/Saraswat-javabug.html. Retrieved 2008-10-08.